Assignment 2

Author

Nathan Lindley

Published

September 15, 2022

1 Article

1.1 Article Link

A Beginner’s Guide to Data Engineering — Part I

1.2 Summary of Article

This article focuses on how the majority of the work that a data scientist does in their day-to-day job is data engineering and cleaning. The author explains that many young data scientists (themself included) had aspirations of creating large and elaborate projects and visualizations using data but unfortunately they ended up doing the more mundane and less glamorous task of data cleaning. Much of the cirriculum that is available teaches us how to work with data that has already been cleaned and made easy to work with. Contrary to this the real-world is full of sensor data and other more ‘raw’ forms of data that need to be cleaned before they’re able to be worked with.

1.3 About the Author

Robert Chang has a Masters Degree in Statistics and does data science for airbnb. He previously did data science for twitter.

1.4 Image From Article

1.5 More Information

  • The author explains that the majority of the work that a data scientist does in their day-to-day job is data engineering and cleaning.
  • There is much more data cleaning than data science in the real world.
  • Three main steps of data engineering: Extract, Transform, Load (ETL)

2 Plot

library(tidyverse)
library(ggplot2)
library(plotly)
plot = ggplot(sleep, aes(x=group, y=extra)) +
  geom_boxplot() +
  labs(title="Extra Sleep by Group", x="Group", y="Extra Sleep (hours)") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(plot)